• Prompt Auto-Editing (PAE) is a method developed by researchers to advance text-to-image generation using diffusion models like Imagen and Stable Diffusion. This innovative approach automates the refinement of text prompts by dynamically adjusting the weights and injection times of specific words, guided by online reinforcement learning.

    Tuesday, April 9, 2024
  • ComfyGen introduces a novel approach to text-to-image generation by focusing on prompt-adaptive workflows. This method recognizes the shift in the user community from using simple, monolithic models to more complex workflows that integrate various specialized components. These workflows can significantly enhance image quality, but creating them requires considerable expertise due to the multitude of available components and their intricate interdependencies. The core innovation of ComfyGen is the automation of workflow generation tailored to specific user prompts. This is achieved through the introduction of two large language model (LLM) baselines. The first is a tuning-based method that learns from user-preference data, while the second is a training-free method that utilizes the LLM to select from existing workflows. Both methods demonstrate improved image quality compared to traditional monolithic models or generic workflows that do not adapt to specific prompts. The implementation of ComfyGen is built around ComfyUI, an open-source tool designed for creating and executing text-to-image pipelines. These pipelines are structured in a JSON format, which is conducive for LLM predictions. To train the LLM on effective workflows, a collection of human-created ComfyUI workflows is augmented by randomly altering parameters such as the base model, LoRAs, samplers, and other settings. A set of 500 prompts is then used to generate images with each workflow, which are scored based on aesthetic appeal and human preferences. This process results in a dataset of (prompt, flow, score) triplets. ComfyGen explores two main approaches for workflow prediction. The first is an in-context method where the LLM is provided with a table of workflows and their corresponding scores, allowing it to select the most suitable one for a new prompt. The second approach involves fine-tuning the LLM with input prompts and scores to predict the optimal workflow for achieving high-quality results. Comparative evaluations show that ComfyGen outperforms both monolithic models and fixed, prompt-independent workflows across various metrics, including human preference and prompt alignment benchmarks. The results from user studies and established benchmarks like GenEval further validate the effectiveness of the proposed methods. In summary, ComfyGen represents a significant advancement in the field of text-to-image generation by automating the creation of tailored workflows that enhance image quality, thereby providing a new avenue for improving user experience in this domain.